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Abstract —Traditional way of storing facts in triplets 
( head_entity, relation, tail_entity), abbreviated as (/j, r, t), makes the 
knowledge intuitively displayed and easily acquired by mankind, 
but hardly computed or even reasoned by AI machines. Inspired 
by the success in applying Distributed Representations to AI- 
related fields, recent studies expect to represent each entity and 
relation with a unique low-dimensional embedding, which is 
different from the symbolic and atomic framework of displaying 
knowledge in triplets. In this way, the knowledge computing and 
reasoning can be essentially facilitated by means of a simple vector 
calculation, i.e. h + r « t. We thus contribute an effective model 
to learn better embeddings satisfying the formula by pulling 
the positive tail entities t + to get together and close to h + 
r ( Nearest Neighbor), and simultaneously pushing the negatives 
t~ away from the positives t + via keeping a Large Margin. 
We also design a corresponding learning algorithm to efficiently 
find the optimal solution based on Stochastic Gradient Descent 
in iterative fashion. Quantitative experiments illustrate that our 
approach can achieve the state-of-the-art performance, compared 
with several latest methods on some benchmark datasets for two 
classical applications, i.e. Link prediction and Triplet classification. 
Moreover, we analyze the parameter complexities among all the 
evaluated models, and analytical results indicate that our model 
needs fewer computational resources on outperforming the other 
methods. 

I. Introduction 

Owing to the long-term efforts on distilling the explosive 
information on the Web, several large scale Knowledge Bases 
(KBs), such as WordNeQ Q), OpenCyt^ £0, YAGC0 0 and 
Freebas^ 0, have already been built. Despite the different 
domains these KBs serve for, nearly all of them concentrate on 
enriching entities and relations. To facilitate storing, displaying 
and even retrieving knowledge, we use the highly structured 
form, i.e. ( head_entity , relation, tail_entity), for knowledge 
representation. Each triplet is called a fact. So far, some 
general domain KBs, such as Freebasaj and YAGCrl have 
contained millions of entities and billions of facts, and they 
should have led a huge leap forward for some canonical 


1 http://www.princeton.edu/wordnet 

-http://www.cyc.com/platform/opencyc 

3 www.mpi-inf.mpg.de/yago-naga/yago 

4 http://www.freebase.com 

5 According to the statistics released by the official site, Freebase contains 
43 million entities and 1.9 billion triplets for now. 

6 As of 2012, YAG02s has knowledge of more than 10 million entities and 
contains more than 120 million facts about these entities. 


AI-related tasks, e.g. Question Answering Systems. However, 
it is realised that this symbolic and atomic framework of 
representing knowledge makes it difficult to be utilized by most 
of Al-machines, especially of those which are dominated by 
statistical approaches. 

Recently, many AI-related fields, such as Image Classifi¬ 
cation 0, Natural Language Understanding 0 and Speech 
Recognition 0, have made significant progress by means 
of learning Distributed Representations. Taking an example 
of distributed representations of words applied to Statistical 
Language Modeling 0, it has achieved considerable success 
by grouping similar words which are close to each other in 
low-dimensional vector spaces. Moreover, Mikolov et al. 0, 
ED discovered somewhat surprising patterns that the learnt 
word embeddings, to some extent, implicitly capture syntactic 
and semantic regularities in language. For example, the result 
of vector calculation v Madrid. - vSpain + V France is closer to 
v Paris than to any other words f 11 j. 

Inspired by the idea of word vector calculation, we look 
forward to making knowledge computable. If we ideally con¬ 
sider the example mentioned above, the most probable reason 

V Madrid V Spain 4“ V France ~ Vp a ris-> is that Capital_dty _of 

is the relation between Madrid and Spain , and so is Paris and 
France. In other words, we can conclude that: 

• There are two facts/triplets, i.e. ( Madrid, 

capital_city_of, Spain) and (Paris, capital_city_of, 
France). 

• We also derive the approximate equation that v Spain ~ 

V Madrid ~ V France Paris from ^Madrid V SpainF 

V France ~ V Paris’ 

The shared relation capital_city_of may help establish the 
approximate equation due to certain implicit connection. If we 
assume that the relation capital_city_of can also be explicitly 
represented by a vector, i.e. v capital_dty_of, the connection will 

he V "Spain V Madrid ~ V France V Paris ~ V capital_city_ °f- 

Therefore, the fact (Madrid, capital_city_of, Spain) can be 
modeled in another way, i.e. v Madrid. + v capital_dt y _of ~ 

V Spain. 

Generally speaking, this paper explores a better approach 
on knowledge representation by means of learning a unique 
low-dimensional embedding for each entity and relation in 
KBs, so that each fact (head_entity, relation, tail_entity) can 






Fig. 1. LMNNE has two objects optimized simultaneously: PULL the positive tail entities (t + ) close to h + r and PUSH the negative tail entities (t ) out 
of the margin 7 . In this case, all the entities and relations are embedded into the 2D vector space and we use L 2 norm to measure the distance. 


be represented by a simple vector calculation h + r ~ t. 
To achieve this goal, we contribute a generic model named 
Large Margin Nearest Neighbor Embedding (LMNNE). As 
intuitively shown by Figure 1 , LMNNE follows the principle 
of pulling the positive tail entities t + close to h + r, and 
simultaneously pushing the negative tail entities (t~) a large 
margin 7 away from the positives t + . The details about model 
formulation are described in Section 3 . Section 4 presents the 
algorithm that solves LMNNE efficiently based on Stochas¬ 
tic Gradient Descent (SGD) in iterative fashion. To prove 
both effectiveness and efficiency of the proposed model, we 
conduct quantitative experiments in Section 5 , i.e. evaluating 
the performance of Link prediction and Triplet classification 
on several benchmark datasets. We also perform qualitative 
analysis via comparing model complexity among all related 
approaches in Section 6. Results demonstrate that LMNNE 
can achieve the state-of-the-art performance and demand for 
fewer computational resources compared with many prior arts. 


II. Related Work 

All the related studies work on studying better way of 
representing a fact/triplet. Usually, they design various scoring 
functions f r (h,t ) to measure the plausibility of a triplet 
( h, r, t ). The lower dissimilarity of the scoring function f r (h , t) 
is, the higher compatibility of the triplet (h, r, t ) will be. 

Unstructured lfl2l is a naive model which just exploits 
the occurrence information of the head and the tail entities 
without considering the relation between them. It defines a 
scoring function ||h — t||, and this model obviously can not 
discriminate entity-pairs with different relations. Therefore, 
Unstructured is commonly regarded as the baseline approach. 

Distance Model (SE) Jl 3 | uses a pair of matrices 
(W r h, W r t), to characterize a relation r. The dissimilarity of a 
triplet ( h,r,t ) is calculate by the L\ norm of ||W r fch —W r tt||. 
As pointed out by Socher et al. CD, the separating matrices 
W r h and W r t weaken the capability of capturing correlations 
between entities and corresponding relations, even though the 
model takes the relations into consideration. 


Single Layer Model proposed by Socher et al. Iil4l aims 
at alleviating the shortcomings of Distance Model by means 
of the nonlinearity of a single layer neural network g(W r f,h + 
W r tt + b r ), in which g = tank. Then the linear output layer 
gives the scoring function: uj g(W r hh + W r t.t + b r ). 

Bilinear Model Q 3 ), OH is another model that tries to fix 
the issue of weak interaction between the head and tail entities 
caused by Distance Model with a relation-specific bilinear 
form: f r {h,t) = \\ T W r t. 

Neural Tensor Network (NTN) Ifffl proposes a general 
scoring function: f r (h,t) = u^g(h 7 W r t + W r h h + W r tt + 
b r ), which combines the Single Layer Model and the Bilinear 
Model. This model is more expressive as the second-order cor¬ 
relations are also considered into the nonlinear transformation 
function, but the computational complexity is rather high. 

TransE lfj~2l is the state-of-the-art method so far. Differing 
from all the other prior arts, this approach embeds relations 
into the same vector space of entities by regarding the relation 
r as a translation from h to t, i.e. h + r = t. This model 
works well on the facts with ONE-TO-ONE mapping property, 
as minimizing the global loss function will impose h + r 
equaling to t. However, the facts with multi-mapping proper¬ 
ties, i.e MANY-TO-MANY, MANY-TO-ONE and ONE-TO- 
MANY, impact the performance of the model. Given a series 
of facts associated with a ONE-TO-MANY relation r, e.g. 
(h, r, ti), (h, r, £2), (h, r, £ m ), TransE tends to represent the 

embeddings of entities on MANY-side extremely the same 
with each other and hardly to be discriminated. Moreover, 
the learning algorithm of TransE only considers the randomly 
built negative triplets, which may bring in bias for embedding 
entities and relations. 

Therefore, we propose a generic model (LMNNE) in 
the subsequent section to tackle the margin-based knowledge 
embedding problem. This model can fully take advantage of 
both positive and negative triplets, and in addition, be flexible 
enough when dealing with the multi-mapping properties. 






III. Proposed Model 

Given a triplet (h,r,t), we use the following formula as 
the scoring function f r (h,t) to measure its plausibility, i.e. 

/r(M) = \\h + r-t\\, (1) 

where | ]□ 11 stands for the generic distance metrics ( L\ norm 
or L 2 norm ) depending on the model selection. 


[□]_!_ is the hinge loss function equivalent to max(0 , □), which 
can spot and update the negative triplets that violate the margin 
constrains. 

Finally, we propose the Large Margin Nearest Neighbor 
Embedding model which uses p to control the trade-off 
between the pulling C pu u and pushing C pU sh operations by 
setting the total loss C as follows, 


Ideally, we look forward to learning a unique low¬ 
dimensional vector representation for each entity and relation, 
so that all the triplets in the KB will satisfy the equation 
h + r = t. However, this can not be done perfectly because 
of the subsequent multi-mapping reasons, 

• Not all entity pairs involve in only one relation. For 
example, ( Barack Obama, presidentofi U.S.A) and 
(Barack Obama, born_in, U.S.A ) are both correct 
facts in the KB. However, we do not expect that the 
embedding of the relation president_of is the same 
as the relation born_in, since those relations are not 
semantically related. 

• Besides the multi-relation property above, we may 
also face the problem of multiple head entities or tail 
entities when the other two elements are given. For 
example, there are at least five correct tail entities, i.e. 
Comedy film, Adventure film, Science fiction, Fantasy 
and Satire, given ( WALL-E, has_the_genre, ?). Like¬ 
wise, we do not desire that those tail entities share the 
same embedding. 

Therefore, we suggest a ‘soft’ way of modeling triplets 
in KBs. Suppose that A is the set of facts in the KB. For 
each triplet ( h,r,t ) in A, we build a set of reconstructed 
triplets A^ r t ) by means of replacing the head or the tail 
with all the other entities in turn. A (h,r,t) can be divided into 
two sets, i.e. A+ hrt ^ and A J hrt y A^ irt ^ is the positive set 
of triplets reconstructed from (h, r, t) as A^ r ^ C A, and 
A^ r t , is the negative because A^ r ^ <£_ A. The intuitive 
goal of our model is to learn the embeddings of positive tail 
entities t + closer to h + r than any other negative embeddings 
t . Therefore, the goal contains two objects, i.e. pulling the 
positive neighbors near (NN) each other while pushing the 
negatives a large margin (LM) away. 

Specifically, for a pair of positive triplets, (h, r , t) and 
( h + ,r, t+), we pull the head or the tail entities together 
(left panel. Figure 1) by minimizing the loss of variances in 
distance, i.e. 

C P uii= Min (||/i-/i + || + ||f —f + ||). 

(l+,r,t+)£i+ 

( 2 ) 

Simultaneously, we push the negative head or tail entities away 
(right panel. Figure 1) via keeping all the negative distances 
f r (h~,t~) at least one margin 7 farther than f r (h + ,t + ). 
Therefore, the objective function is, 

£ P «^=Min ^ X! [7 +f r {h,t)-f r (h~,t~)] 

( h,r,t)eA (h~,r,t~)G A“ r t) 

( 3 ) 


C = Min pCpuii + (1 - p)C push . (4) 

IV. Learning Algorithm 

To efficiently search the optimal solution of LMNNE, 
we use Stochastic Gradient Descent (SGD) to update the 
embeddings of entities and relations in iterative fashion. As 
shown in Algorithm 1, we firstly initial all the entities and 
relations following a uniform distribution. Each time we pick 
a triplet ( h,r,t ) from A, an accompanied triplet ( h',r,t ') is 
sampled at the same time by replacing the head or the tail 
with another entity from the entity set E, i.e. A^ h r ^ = 
{(h!,r,t)\h' £ E} U {(h, r, t')\t' £ E }. Then we choose one 
of the pair-wise SGD-based updating options depending on 
which camp {A^ h r t) or A^ h r t) ) that ( h',r,t ') belongs to. 


Algorithm 1 The Learning Algorithm of LMNNE 
Input: 

Training set A = {(h, r, i)}, entity set E, relation set R\ 
dimension of embeddings d, margin 7, learning rate a and 
(3 for Cp U ii and C pus h respectively, convergence threshold 
e, maximum epoches n and the trade-off p. 

foreach r 7 I! do 

-M 

Cd> 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


r := Uniform(- 7 = 

r := R 

end foreach 

i := 0 

while Rel.loss > e and i < n 

foreach e £ E do 

e := Uniform(^S, 

e R 

end foreach 


do 


foreach ( h,r,t) £ A do 

(h',r,t') := Sampling(A , (/! rt) ) 
if (, h',r,t') £ A+ h r t) then 

Updating: V(h,r,t, h'^puil with: ap 

end if 

if (, h',r,t') £ A^ h r t) then 

Updating : V( h , r ,t,h',t')£ P ush with: /3(1 - p) 

end if 
end foreach 
end while 
Output: 

All the embeddings of e and r, where e £ E and r £ R. 


V. Quantitative Experiments 

Embedding the knowledge into low-dimensional vector 
spaces makes it much easier for Al-related computing tasks, 
+ ■ such as Link prediction (predicting t given h and r or h given 
r and t ) and Triplet classification (to discriminate whether a 
triplet (h,r,t) is correct or wrong). Two latest studies ll2l . 






lfl~4l used subsets of WordNet (WN) and Freebase (FB) data 
to evaluate their models and reported the performance on the 
two tasks respectively. 

In order to conduct solid experiments, we compare our 
model (LMNNE) with many related studies including state- 
of-the-art and baseline approaches involving in the two tasks, 
i.e. Link prediction and Triplet classification. All the datasets, 
the source codes and the leamt embeddings for entities and 
relations can be downloaded from http://ldrv.ms/lpUPwzP 


A. Link prediction 

One of the benefits on knowledge embedding is that we can 
apply simple vector calculations to many reasoning tasks. For 
example. Link prediction is a valuable task that contributes 
to completing the knowledge graph. Specifically, It aims at 
predicting the missing entity or the relation given the other 
two elements in a mutilated triplet. 

With the help of knowledge embeddings, if we would like 
to tell whether the entity h has the relation r with the entity 
t, we just need to calculate the distance between h + r and 
t. The closer they are, the more possibility the triplet (h, r, t) 
exists. 

1) Datasets: Bordes et al. G 2 , na released two bench¬ 
mark datasetf] which were extracted from WordNet (WN18) 
and Freebase (FB15K). Table 1 shows the statistics of these 
two datasets. The scale of FB15K dataset is larger than WN18 
with much more relations but fewer entities. 

TABLE I. Statistics of the datasets used for link prediction 

task. 


DATASET 

WN18 

FB15K 

#(ENTITIES) 

40,943 

14,951 

#(RELATIONS) 

18 

1,345 

#(TRAINING EX.) 

141,442 

483,142 

#(VALIDATING EX.) 

5,000 

50,000 

#(TESTING EX.) 

5,000 

59,071 


2) Evaluation Protocol: For each testing triplet, all the 
other entities that appear in the training set take turns to replace 
the head entity. Then we get a bunch of candidate triplets 
associated with the testing triplet. The dissimilarity of each 
candidate triplet is firstly computed by the scoring functions, 
then sorted in ascending order, and finally the rank of the 
ground truth triplet is recorded. This whole procedure runs 
in the same way for replacing the tail entity, so that we can 
gain the mean results. We use two metrics, i.e. Mean Rank 
and Mean Hit@10 (the proportion of ground truth triplets that 
rank in Top-10), to measure the performance. However, the 
results measured by those metrics are relatively inaccurate, 
as the procedure above tends to generate the false negative 
triplets. In other words, some of the candidate triplets rank 
rather higher than the ground truth triplet just because they 
also appear in the training set. We thus filter out those triplets 
to report more reasonable results. 


7 The datasets can be downloaded from https://www.hds.utc.fr/everest/doku. 
php?id=en:transe 


3) Experimental Results: We compare our model 
LMNNE with the state-of-the-art TransE and other models 
mentioned in El and IT8l evaluated on the WN18 and 
FB15K. We tune the parameters of each previous modef] 
based on the validation set, and select the combination 
of parameters which leads to the best performance. The 
results are almost the same as fl2l . For LMNNE, we tried 
several combinations of parameters: d = {20,50,100}, 

7 = {0.1,1.0,2.0,10.0}, a = {0.01,0.02,0.05,0.1,0.5,1.0}, 
P = {0.01,0.02,0.05,0.1,0.5,1.0} and p 

{0.2, 0.4,0.5, 0.6,0.8}, and finally chose d = 20, 7 = 2.0, 
a = = 0.02 and p = 0.6 for WN18 dataset; d = 50, 

7 = 1.0, a = p = 0.02 and p = 0.6 for FB15K dataset. 
Moreover, we adopted different distance metrics, such as L\ 
norm, L 2 norm and inner product, for the scoring function. 
Experiments show that L> norm and L\ norm are the best 
choices to measure the distances in C pu u and C pus h for both 
of the two datasets, respectively. Table 2 demonstrates that 
LMNNE outperforms all the prior arts, including the baseline 
model Unstructured [QD, RESCAL ED, SE Q3], SME 
(LINEAR) ED, SME(BILINEAR) QJ], LFM fT 6 ) and the 
state-of-the-art Transtjj 1171 . Il2l . measured by Mean Rank 
and Mean Hit@10. 

Moreover, we divide FB15K into different categories 
(i.e. ONE-TO-ONE, ONE-TO-MANY, MANY-TO-ONE and 
MANY-TO-MANY) based on the mapping properties of facts. 
According to TransE Il2l . we set 1.5 as the threshold to 
discriminate ONE and MANY. For example, given a pair (h, r), 
if the average number of tails appearing in the dataset is upon 
1.5, we can categorize the triplets involving in this relation r 
into the ONE-TO-MANY class. We evaluate the performance 
of Filter Hit@ 10 metric on each category. Table 3 shows 
that LMNNE performs best on most categories. The result 
proves that the proposed approach can not only maintain the 
characteristic of modeling the ONE-TO-ONE triplets, but also 
better handle the facts with multi-mapping properties. 

B. Triplet classification 

Triplet classification is another task proposed by Socher et 
al. El which focuses on searching a relation-specific distance 
threshold cr r to tell whether a triplet (h, r, t) is plausible. 

1) Datasets: Similar to Bordes et al. 071 . 112l , Socher et 
al. El also constructed two standard dataset} 11 } i.e. WN11 
and FB13, sampled from WordNet and Freebase. However, 
both of the datasets contain much fewer relations. Therefore, 
we build another dataset following the principle proposed 
by Socher et al. El based on FB15K which owns much 
more relations. The head or the tail entity can be randomly 
replaced with another one to produce a negative triplet, but 
in order to build much tough validation and testing datasets, 
the principle emphasizes that the picked entity should once 
appear at the same position. For example, (Pablo Picaso, 

8 All the codes for the related models can be downloaded from https://github. 
com/glorotxa/SME 

y To conduct fair comparisons, we re-implemented TransE based on the 
SGD algorithm (not mini-batch SGD) and fed the same random embeddings 
for initializing both LMNNE and TransE. That’s the reason why our 
experimental results of TransE are slightly different from the original paper 

cld, mi. 

10 Those datasets can be download from the website http://www.socher.org/ 
index.php 











TABLE II. 


Link prediction results. We compared our proposed LMNNE with the state-of-the-art method (TransE) and other 

PRIOR ARTS. 


DATASET 


WN18 

FB15K 

METRIC 


MEAN RANK 

MEAN HIT@ 10 

MEAN RANK 

MEAN HIT@ 10 


Raw 

Filter 


Raw 

Filter 

Raw 

Filter 


Raw 

Filter 

Unstructured 


315 

304 


35.3% 

38.2% 

1,074 

979 


4.5% 

6.3% 

RESCAL 


1,180 

1,163 


37.2% 

52.8% 

828 

683 


28.4% 

44.1% 

SE 


1,011 

985 


68.5% 

80.5% 

273 

162 


28.8% 

39.8% 

SME (LINEAR) 


545 

533 


65.1% 

74.1% 

274 

154 


30.7% 

40.8% 

SME (BILINEAR) 


526 

509 


54.7% 

61.3% 

284 

158 


31.3% 

41.3% 

LFM 


469 

456 


71.4% 

81.6% 

283 

164 


26.0% 

33.1% 

TransE 


294.4 

283.2 


70.4% 

80.2% 

243.3 

139.9 


36.7% 

44.3% 

LMNNE 


257.3 

245.4 


73.7% 

84.1% 

221.2 

107.4 


38.3% 

48.2% 


TABLE III. RESULTS OF Filter Hit® 10 (IN %) ON FB 15K CATEGORIZED BY DIFFERENT MAPPING PROPERTIES OF FACTS (M. STANDS FOR MANY). 


TASK 


Predicting head 

Predicting tail 

REL. Mapping 


1-TO-l 

1-TO-M. 

M.-TO-l 

M.-TO-M. 

1-TO-l 

1-TO-M. 

M.-TO-l 

M.-TO-M. 

Unstructured 


34.5% 

2.5% 

6.1% 

6.6% 

34.3% 

4.2% 

1.9% 

6.6% 

SE 


35.6% 

62.6% 

17.2% 

37.5% 

34.9% 

14.6% 

68.3% 

41.3% 

SME (LINEAR) 


35.1% 

53.7% 

19.0% 

40.3% 

32.7% 

14.9% 

61.6% 

43.3% 

SME (BILINEAR) 


30.9% 

69.6% 

19.9% 

38.6% 

28.2% 

13.1% 

76.0% 

41.8% 

TransE 


43.7% 

65.7% 

18.2% 

47.2% 

43.7% 

19.7% 

66.7% 

50.0% 

LMNNE 


59.2% 

77.8% 

17.5% 

45.5% 

58.6% 

20.0% 

80.9% 

51.2% 


TABLE IV. Statistics of the datasets used for triplet 

CLASSIFICATION TASK. 


DATASET 

WN11 

FB13 

FB15K 

#(ENTITIES) 

38,696 

75,043 

14,951 

#(RELATIONS) 

11 

13 

1.345 

#(TRAINING EX.) 

112,581 

316,232 

483,142 

#(VALIDATING EX.) 

5,218 

11,816 

100,000 

#(TESTING EX.) 

21,088 

47,466 

118,142 


nationality, American) is a potential negative example rather 
than the obvious irrational (Pablo Picaso, nationality, Van 
Gogh), given a positive triplet (Pablo Picaso, nationality, 
Spainish), as American and Spainish are more common as 
the tails of nationality. Table 4 shows the statistics of the 
standard datasets that we used for evaluating models on the 
triplet classification task. 

2) Evaluation Protocol: The decision strategy for binary 
classification is simple: If the dissimilarity of a testing triplet 
(h,r,t) computed by f r (h,t) is below the relation-specific 
threshold ay, it is predicted as positive, otherwise negative. The 
relation-specific threshold o> can be searched via maximizing 
the classification accuracy on the validation triplets which 
belong to the relation r. 

3) Experimental Results: We use the best combination 
of parameter settings in the Link prediction task ( d = 20, 
7 = 2.0, a = ft = 0.02 and p = 0.6 for WN11 dataset; 
d = 50, 7 = 1.0, a = /3 = 0.02 and p, = 0.6 for both 
FB13 and FB15K datasets) to generate the entity and relation 
embeddings, and learn the best classification threshold ay for 
each relation r. Compared with the state-of-the-art, i.e. TransE 
ED, fl7l and other prior arts, such as Distance Model lf]~3l . 
Hadamard Model ifeol . Single Layer Model [ 141, Bilinear 
Model fl5l . 8151 and Neural Tensor Network (NTNpl and 
m. the proposed LMNNE still achieves better performance 
as shown in Table 5. 


1 tocher et al. reported higher classification accuracy in m with word 
embeddings. In order to conduct a fair comparison, the accuracy of NTN 
reported in Table 5 is same with the EV (entity vectors) results in Figure 4 

of m. 


TABLE V. The accuracy of triplet classification compared 

WITH THE STATE-OF-THE-ART METHOD (TRANSE) AND OTHER PRIOR 
ARTS. 


DATASET 

WN11 

FB13 

FB15K 

Distance Model 

53.0% 

75.2% 

- 

Hadamard Model 

70.0% 

63.7% 

- 

Single Layer Model 

69.9% 

85.3% 

- 

Bilinear Model 

73.8% 

84.3% 

- 

NTN 

70.4% 

87.1% 

66.7% 

TransE 

77.5% 

67.5% 

85.8% 

LMNNE 

78.6% 

74.8% 

86.8% 


VI. Qualitative Analysis 

In addition to the quantitative experiments of evaluating 
the performance on Link prediction and Triplet classification 
with several benchmark datasets, we analytically compare the 
parameter complexity among the approaches that we have 
mentioned as well. Table 6 lists the theoretical costs on 
representing triplets (h, r, t) based on the scoring functions 
of nearly all the classical models. Except for Unstructured, 
TransE and LMNNE, the other approaches regard the relation 
r as a transportation matrix serving for the entities h and t. 
These models need more resources on storing and computing 
embeddings. Unstructured costs least, but does not contain 
any information on relations. LMNNE and TransE embed 
both entities and relations into the low-dimensional vector 
spaces from different aspects of observations: LMNNE regards 
a relation as the inner connections of word embedding calcu¬ 
lations, and TransE considers it as a kind of translation from 
one entity to another. Despite the varies angles of modeling, 
both of them are relatively efficient models for knowledge 
representation. 

VII. Conclusion and Future Work 

Knowledge embedding is an alternative way of repre¬ 
senting knowledge besides displaying in triplets, i.e. ( h , r, t). 
Its essence is to learn a distributed representation for each 
entity and relation, to make the knowledge computable, e.g. 
h + rwt. 

To achieve higher quality of embeddings, we propose 
LMNNE, a both effective and efficient model on learning 




































































TABLE VI. The scoring function and parameter complexity analysis for all the models mentioned in the experiments. For all 

THE MODELS, WE ASSUME THAT THERE ARE A TOTAL OF n e ENTITIES, n r RELATIONS (IN MOST CASES, n e > TV.), AND EACH ENTITY IS EMBEDDED 
INTO A d-DIMENSIONAL VECTOR SPACE, I.E h, t S R d . WE ALSO SUPPOSE THAT THERE ARE S SLICES IN A TENSOR FOR THE NEURAL-NETWORK RELATED 

MODELS, I.E Single Layer Model AND Neural Tensor Network. 


Model 

Scoring Function 

Parameter Complexity 

Unstructured 

l|h-t|| 

n e d 

Distance Model (SE) 

II WTfih — W rt t\\: 

(w rh ,w rt ) e K dxd 

n e d + 2 n r d 2 

Single Layer Model 

uf tanh(W r / l h + W r tt + b r ); 

(W rh , w rt ) e R sXd , (u r , b r ) G K s 

n e d + 2 n r (sd + s ) 

Bilinear Model 

h J W r t: 

w r e R dxd 

n e d + n r d 2 

Neural Tensor Network (NTN) 

tanh(h J W r t + VF r ^h + W r ±t + b r ); 

W r G R dxdxs , ( w rh , w rt ) e R sXd , (u r , b r ) G R s 

n e d + n r (sd 2 + 2 sd + 2s) 

TransE and LMNNE 

l|h + r-t||; 

r 6 R d 

n e d + n r d 


a low-dimensional vector representation for each entity and 
relation in Knowledge Bases. Some canonical tasks, such as 
Link prediction and Triplet classification, which were ever 
based on hand-made logic rules, can be truly facilitated by 
means of the simple vector calculation. The results of extensive 
experiments on several benchmark datasets and complexity 
analysis show that our model can achieve higher performance 
without sacrificing efficiency. 

In the future, we look forward to paralleling the algorithm 
which can encode a whole KB with billion of facts, such as 
Freebase and YAGO. Another direction is that we can apply 
this new way of Knowledge Representation on reinforcing 
some other related studies, such as Relation Extraction from 
free texts and Open Question Answering. 
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